Zookeeper
enables coordination and synchronization with what we calls “znodes”,
which are presented as a directory tree and resemble the file path names we
would see in a Unix file system. Znodes are capable of storing data but not
much to speak of — currently less than 1 MB by default. The concept here is
that Zookeeper stores znodes in memory and that these memory-based znodes enables
quick client access for coordination, status, and other important functions needed
by distributed applications like HBase. Zookeeper replicates (make copies)
znodes across the ensemble so if servers fail, the znode data will still be available
as long as a majority quorum of servers is still up and running.
The
next crucial Zookeeper concept concerns about how znode reads (versus writes)
are managed. Every Zookeeper server can manage reads from a client, including
the leader, however only the leader issues atomic znode writes — writes which
are either completely succeed or completely fail. When a znode write request comes
at the leader node, the leader broadcasts the write request to its following nodes
and then waits for a majority of followers to acknowledge this znode write
complete. After the acknowledgement, the leader issues the znode write itself
and then reports the successful completion status to the client
Znodes
enables some very powerful and flexible guarantees. When a Zookeeper client (like
a HBase RegionServer) writes or reads a znode, this operation is atomic. It
either completely succeeds or completely fails — there doesn’t exist any
partial reads or writes. No other competing client can lead the read or write operation
to fail. Furthermore, a znode has an access control lists (ACL) associated
with it for security purposes, which supports versions, timestamps and
notification to clients when it modifies.
Zookeeper
replicates znodes across the ensemble so if a server fail, the znode data is still
available as long as a majority quorum of servers is still up and running. This
implies that writes to any znode from any Zookeeper server must be propagated
across the ensemble. The Zookeeper leader handles this operation.
This
znode write mechanism can result followers to fall behind the leader for short
periods. Zookeeper resolves this potential problem by providing a
synchronization command. Users that cannot tolerate this temporary lack of synchronization
within the Zookeeper cluster may decide to issue a sync command before reading
any znode.
In a znode world, we are going to come across what looks like the Unix-style pathnames. (Typically they begin with /hbase.) These pathnames, which are a subset of the znodes in the Zookeeper system created by HBase, are described in this list:
·
master: Holds the name of
the primary MasterServer,
·
hbaseid: Holds the cluster’s
ID,
·
root-region-server: Points
to the RegionServer holding the -ROOT- table),
· Something called /hbase/rs.
So now we may wonder what’sup with this rather vaguely defined /hbase/rs. In the previous post (HBase Architecture: Master Server Part 4), we describe the various operations of the MasterServer and mention that Zookeeper notifies the MasterServer whenever a RegionServer fails. Now we help you take a closer look at how the process actually works in HBase — and we would be right to assume that it has something to do with /hbase/rs. Zookeeper uses its watches mechanism to notify clients whenever a znode is created, accessed, or changed in some way. The MasterServers are Zookeeper clients as well as the RegionServers and can leverage these znode watches. When a RegionServer comes online in the HBase system, it connects to the Zookeeper ensemble and creates its own unique ephemeral znode under the znode pathname /hbase/rs. At the same time, the Zookeeper system establishes a session with the RegionServer and monitors the session for events. If the RegionServer breaks the session for whatever reason (by failing to send a heartbeat ping, for example), the ephemeral znode that it created is deleted. The action of deleting the RegionServer’s child node under /hbase/rs will cause the MasterServer to be notified so that it can initiate RegionServer failover. This notification is accomplished by way of a watch that the MasterServer sets up on the /hbase/rs znode.
Leave Comment
1 Comments